Token based cache-coherence protocol

ABSTRACT

A cache coherence mechanism for a shared memory computer architecture employs tokens to designate a particular node&#39;s rights with respect to writing or reading a block of shared memory. The token system provides a correctness substrate to which a number of performance protocols may be freely added.

CROSS-REFERENCE TO RELATED APPLICATIONS STATEMENT REGARDING FEDERALLYSPONSORED RESEARCH OR DEVELOPMENT FIELD OF THE INVENTION

[0001] The present invention relates generally to a system forcoordinating cache memories in a computing system.

BACKGROUND OF THE INVENTION

[0002] Large computer software applications, such as simulators anddatabase servers, require cost-effective computation beyond that whichcan be provided by a single microprocessor. Shared-memory,multiprocessor computers have emerged as a popular solution for runningsuch applications. Most shared memory multiprocessor computers provideeach constituent processor with a cache memory into which portions ofthe shared memory (“blocks”) may be loaded. The cache memory allowsfaster memory access.

[0003] A cache coherence protocol ensures that the contents of the cachememories accurately reflect the contents of the shared memory.Generally, such protocols invalidate all other caches when one cache iswritten to, and update the main memory before a changed cache isflushed.

[0004] Two important classes of protocols for maintaining cachecoherence are “directories” and “snooping”. In the directory protocols,a given “node” typically being a cache/processor combination, “unicasts”its request for a block of memory to a directory which maintainsinformation indicating those other nodes using that particular memoryblock. The directory then “multicasts” requests for that block directlyto a limited number of indicated nodes. Generally, the multicast will beto a superset of the nodes greater than the number that actually haveownership or sharing privileges because of transactions which are notrecorded in the directory, as is understood in the art. The“indirection” of directory protocols, requiring messages exchanged withthe directory prior to communication between processors, limits thespeed of directory protocols.

[0005] The problem of indirection is avoided in snooping protocols wherea given cache may “broadcast” a request for a block of memory to allother “nodes” in the system. The nodes include all other caches and theshared memory itself. The node “owning” that block responds directly tothe requesting node, forwarding the desired block of memory.

[0006] Snooping, however, requires that “message ordering” be preservedon the interconnection between communicating nodes. Generally this meanseach node can unambiguously determine the logical order in which allmessages must be processed. This has been traditionally guaranteed by ashared wire bus. Without such ordering, for example, a first node mayask for a writeable copy of a block held by memory at the same time thatit sends messages to other nodes invalidating their copies of the blockin cache for reading. A second node receiving the invalidation messagemay ignore it because the second node does not have the block, but thenthe second node may request the block for reading before the first nodereceives the block from memory for writing. When the first node finallydoes receive the block, the second node erroneously believes it has areadable copy.

[0007] The “correctness” of memory access in snooping is tightly linkedto this requirement of a message ordering in the communications betweenprocessors. This and other requirements of the snooping protocolcomplicate any modifications of snooping to increase its performance.

BRIEF SUMMARY OF THE INVENTION

[0008] In the invention, memory access is controlled by “tokens” whosenumber is globally “known” and whose possession by a node simply andintuitively designates the state of a node's cache blocks. Generallyspeaking, a node having all the tokens for a block may write to or readfrom the block, a node having at least one token but less than alltokens may only read from the block, and a node having no tokens canneither write to nor read from the block.

[0009] By and large, this system provides certainty in the “correctness”of memory access independent of most other aspects of the cachecoherence protocol. The invention thereby provides a robust foundation(a “correctness substrate”) on which a variety of other performanceenhancing protocol steps may be readily added.

[0010] Specifically, the present invention provides a shared memorycomputer architecture having at least two processor units (each having aprocessor and cache), a shared memory, and an interconnect allowingcommunication between the processor units and the shared memory. Theinvention also provides cache management circuitry operating to: (i)establish a set of tokens of known number; (ii) allow a processor towrite to at least a portion of the shared memory through its cache onlyif it has all the tokens for that portion; and (iii) allow a processorto read from at least a portion of the shared memory through its cacheonly if it has at least one of the tokens for that portion.

[0011] Thus, it is one object of the invention to provide a simple andintuitive protocol for coordinating memory access in a shared memorycomputer system.

[0012] The cache management circuitry may be distributed among theprocessor units and the memory.

[0013] Thus, it is another object of the invention to provide anarchitecture that may work with a variety of different architecturemodels including “glueless” architectures in which most circuitry iscontained in a replicated, elemental building block.

[0014] The cache management circuitry may respond to a request by aprocessor unit to write to a portion of shared memory by sending toother processor units a write request for that portion. The cachemanagement circuitry may further respond to the write request at areceiving processor having at least one token for a portion, to send alltokens for that portion held by the receiving processor to therequesting processor.

[0015] Thus, it is an object of the invention to provide a simple methodof transferring cache write permission.

[0016] The request may be broadcast to all other processor units.

[0017] Thus, it is another object of the invention to provide a simplebroadcast-based protocol. Notwithstanding this object, the presentinvention may also work with multicast transmissions to conservebandwidth and thus improve performance.

[0018] One token may be an “owner” token and the cache managementcircuitry responding to the write request may send the portion of theshared memory held by the receiving processor and the tokens to therequesting processor only when the receiving processor holds the ownertoken. Receiving processor units not having the owner token also sendtheir token but need not send the portion of shared memory.

[0019] Thus, it is an object of the invention to reduce interconnectdata traffic. Processor units which are not owners may transmit theirtokens without data, knowing that the owner will transmit that data.

[0020] The cache management circuitry may alternatively respond to aread request by sending to other processor units a read request message,and the cache management circuitry may respond to the read requestmessage at receiving processors having at least one token to send atleast one token for that portion held by the receiving processor to therequesting processor. In a preferred embodiment, typically only onetoken is sent.

[0021] It is thus another object of the invention to minimize theunnecessary movement of tokens. On the other hand, multiple tokens maybe sent if predictively it is expected that the receiving processingunit may need write permission shortly.

[0022] When the receiving processor has the owner token, the cachemanagement circuit may send a token that is not the owner token unlessthe receiving processor has only one token.

[0023] Thus, it is one object of the invention to avoid unnecessarytransmission of the ownership token which normally must be accompaniedby the data of the requested portion of shared memory.

[0024] The cache management circuitry may respond to a predeterminedfailure of a requesting processor to obtain tokens, by retransmitting toother processors a request to the portion after a back-off time. Theback-off time may be randomized and/or increased for eachretransmission.

[0025] Thus it is another object of the invention to reduce situationswhere a processor unit does not promptly get the tokens, permissionand/or data it is seeking. By repeating the request after a back-offtime, collisions may be efficiently avoided in most cases.

[0026] The cache management circuitry may respond to a predeterminedfailure of a requesting processor to obtain tokens by transmitting toother processors a persistent request requiring the other processor toforward tokens for that portion of shared memory until a deactivationmessage is received and wherein the requesting processor allows adeactivation signal only after receiving the necessary tokens. The cachemanagement circuitry responds to the persistent request to send anynecessary tokens for the portion held or received by the receivingprocessor between the occurrence of the persistent request and thedeactivation signal.

[0027] Thus, it is another object of the invention to provide for amechanism that assures no starvation of a given processor.

[0028] When multiple requesting processors fail to obtain tokens, thecache management circuitry may select one processor unit to benefit froma persistent request and then a second after the first has completed itstoken acquisition.

[0029] Thus, it is another object of the invention to allow theimposition of an arbitration mechanism in the case of conflicts betweenprocessor units.

[0030] The cache management circuitry may select the order of service ofthe multiple requesting processors to minimize the communication burdenbetween successive multiple processors.

[0031] Thus, it is another object of the invention to provide amechanism for more sophisticated resolution of conflicting memoryrequests based on minimizing data transmission time or costs.

[0032] The interconnect may be an unordered interconnect.

[0033] It is thus a further object of the invention to provide a cachecoherence protocol that does not require the hardware overhead andcomplexity of a message ordered interconnect.

[0034] These particular objects and advantages may apply to only someembodiments falling within the claims and thus do not define the scopeof the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

[0035]FIG. 1 is a blocked diagram of a multiprocessor, shared-memorycomputer system having sets of processor units, including a processorand cache, communicating on a network with a common shared memory;

[0036]FIG. 2 is a detailed block diagram of a processor unit showing theprocessor, cache, and a portion of the cache controller circuitry inturn having a token table and a persistent request table;

[0037]FIG. 3 is a representation of token flow between processor unitsand the shared memory required for a processor to read shared memory;

[0038]FIG. 4 is a figure similar to that of FIG. 3 showing token flowbetween processor units and the shared memory required for a processorto write shared memory;

[0039]FIG. 5 is a flow chart of the steps executed by the cachecontroller circuitry when a processor unit cannot obtain desired tokenswithin a predetermined period of time;

[0040]FIG. 6 is a table showing the response of a processor unit todifferent requests by other processor units for tokens as implemented bythe cache control circuitry;

[0041]FIG. 7 is a figure similar to that of FIGS. 3 and 4 showing theflow of persistent request and deactivation messages when token transferis delayed more than a predetermined amount; and

[0042]FIG. 8 is a persistent request arbitration table that may beimplemented in the processor units to allow for a more sophisticatedarbitration without a central directory.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS System Elements

[0043] Referring now to FIG. 1, a multiprocessor, shared-memory computersystem 10 may include a number of processor units 12 communicating viaan interconnect 14 with a shared memory 16. The processor units 12 andshared memory 16 will be referred to collectively as “nodes”. Cachemanagement circuitry 18 communicates with the processor units 12 and theshared memory 16 to control access by the processor units 12 of theshared memory 16. The cache management circuitry 18 may be distributedamong the nodes and/or may have centralized components to be compatiblewith a wide variety of computer architectures.

[0044] Referring still to FIG. 1, the shared memory 16 may be, forexample, high speed solid state memory and provides a common storagearea for data used by all the processor units 12. Although the sharedmemory 16 is depicted as a unitary structure, in practice, the sharedmemory 16 may be distributed over the interconnect 14 or even among thedifferent processor units 12.

[0045] The interconnect 14 may be, for example, a parallel bus structureor a serial network and may have a tiered structure, as shown, generallyreflecting differences in communication speed between processor units12. For example, the processor units 12 may be organized into clusters,here labeled P₀-P₃ for a first cluster and P₄-P₇ for a second cluster.Communications within a cluster may be faster than communicationsbetween clusters and, for this reason, each of the processor units 12may be assigned an identification number generally reflecting itsrelative proximity to other processor units 12. Closer numbers canindicate closer proximities and this information may be used to optimizedata transfer as will be described below. The interconnect 14 may use avirtual network to avoid deadlocks, as is understood in the art.

[0046] Referring to FIG. 2, each processor unit 12 includes a processor20 communicating with one or more cache levels (shown for clarity as asingle cache 22). The cache 22 is typically divided into a number ofblocks 24 representing convenient units of data transfer between theshared memory 16 and the processor units 12. The cache 22 and processor20 communicate via an internal bus 28 with cache controller 26, beingpart of the cache management circuitry 18, which in turn connects to theinterconnect 14.

[0047] Generally, the cache controller 26 will operate to move blocks ofthe shared memory 16 into the cache 22 for rapid access (reading andwriting) by the processor 20. The cache controller 26 will then hold theblock or transfer it to another processor unit 12 or if the block mustbe evicted, return the block to shared memory 16. As will be describedin greater detail below, the cache controller performs these operationsusing a set of tokens that may be passed among the nodes by messages onthe interconnect 14. Generally, token possession maps to traditionalcache coherence states where a node having all T tokens for a givencache block 24 holds the block in a modified (M) state. A node havingone to T minus one tokens holds the block in a shared (S) state, and anode having no tokens holds the block in an invalid state (I). Each ofthese states will be recognized by one of ordinary skill in the art.Through the use of tokens, correctness in data access is ensured withoutthe need for detailed knowledge about stable and transient protocolstates, data acknowledgement messages, and interconnect and/or systemhierarchy.

[0048] In accomplishing its task, the cache controller 26 employs atoken table 30 providing, effectively, one row 32 for each block 24 ofthe cache 22. A third column of each row 32 indicates the number oftokens held by the processor units 12 for a particular block 24. It isthrough this token table 30 that tokens are “held” by a processor unit12 after being transmitted between the processor units 12 and/or theshared memory 16 over the interconnect 14. This information about thenumber of tokens is linked to a valid bit in a first column of the row32 and an owner bit in a second column of the row 32. The owner bit isset when one of the tokens held is a system-unique owner token as willbe described below. The valid bit indicates that the data of the block24 associated with the tokens of the row 32 is valid and is not requiredin the simplest version of the protocol. In this more complex versionusing a valid bit, it is possible to hold tokens without valid data ofthe block 24. This can be useful if a data-less message arrives with atoken prior to arrival of other messages with tokens and the necessarydata.

[0049] Shared memory 16 also has a token table 30 (not shown) so it canacquire and share tokens. Initially all tokens are held by the sharedmemory 16.

[0050] Each node also includes or shares a persistent request table 34providing, for example, a number of logical rows 33 equal to a number ofnodes in the multiprocessor, shared-memory computer system 10. The cachecontroller 26 and shared memory 16 can thus have access to thepersistent request table 34. Each row 33 is identified to a node by anode number in a first column. A second column of each row 33 identifiesa particular block 24, if any, for which the node is making a persistentrequest. The use of a persistent request will be described below.

A Request to Read Shared Memory

[0051] Referring now generally to FIGS. 2 and 3, the cache managementcircuitry 18 initially establishes a set of tokens that will betransmitted between nodes requesting read or write permissions for theshared memory 16. The tokens may be fixed in number or another mechanismmay be adopted so that all components know the total number of tokens.No exclusively local action may change the number of tokens withouteventual global communication of that change. The tokens are transmittedas specific data patterns and have no physical embodiment. The tokensare transmitted and control the processor units according the followinginvariants enforced by the cache management circuitry 18.

[0052] Invariant I: At all times each cache block 24 has an establishednumber of tokens. Optionally, and as will be described here, one tokenmay be the owner token. Each cache block 24 may have a different numberof tokens so long as this number is known globally.

[0053] Invariant II: A node can write a block 24 only if it holds all Ttokens for that block 24.

[0054] Invariant III: A node can read a block 24 only if it holds atleast one token for that block 24. Optionally, and as will be describedhere, the node may also need to check to see that it has valid data bychecking the valid data bit.

[0055] Invariant IV: If a cache coherence message contains data of ablock 24, it must contain at least one token.

[0056] Invariant V: If a cache coherence message contains one or moretokens it must contain data of the block. Optionally, and as will bedescribed here, the data need only be sent if the message contains theowner token.

[0057] These invariants are sufficient to ensure correctness of memoryaccess and requires at a minimum, T undifferentiated tokens for eachblock. The number of tokens may desirably be greater than the number ofnodes without upsetting the correctness provided by the token system. Agreater number of tokens addresses the fact that some tokens will be intransit between nodes and allows a greater freedom in reading the sharedmemory 16 such as may be desired in certain architectures exhibitingsome types of timing constraints. With some loss in performance, anumber of tokens less that the number of nodes may also be used.

[0058] An optional improvement in efficiency of transfer of blocks 24between processor units 12 may be obtained by the addition of onedifferentiated token called the “owner” token. The owner token may betransmitted over the interconnect 14 and recorded in the token table 30by the setting of the owner bit as has been described above. In thefollowing examples, it will be assumed that an owner token is used,however, it will be understood that the owner token is not required forcorrectness. Thus, the owner token is simply a performance-enhancingfeature of a type that may be grafted onto the correctness substrate bythe tokens. Generally, the owner token carries with it a responsibilitynot to discard the data and to be the node to transmit the data when itis requested.

[0059] Referring now to FIG. 3, in a simple memory access example, agiven processor unit P₀ may need to read a particular block 24 of sharedmemory 16. As an initial matter, it will be assumed that the block 24 isheld in the shared memory 16 and the four tokens 40 associated with eachof the nodes of the processor units 12 and shared memory 16 areinitially held at shared memory 16.

[0060] Per invariant III, the processor unit P₀ cannot read the block 24from its cache 22 until it has at least one token 40. Accordingly, thecache processor unit P₀ (via its cache controller 26) transmits a readmessage 36 requesting tokens over the interconnect 14 in broadcastfashion to each of the remaining nodes of processor units P₁ and P₂ andshared memory 16. This broadcast does not require the processor unit P₀to know the node at which valid data of the block 24 is held.

[0061] In an alternative embodiment, the broadcasting described hereinmay be a single or multi-cast based on predictions of the location ofthe tokens. Such predictions may be based on an observation ofhistorical movement of the tokens or imperfect monitoring of tokenlocation through the transmitted messages. As will be understood fromthis description, the token system ensures data correctness even in theevent of incorrect predictions.

[0062] Referring to FIG. 6, upon receipt of the read messages by thenodes, a set of standard responses enforced by the cache controller 26will occur. The table of FIG. 6 describes generally four possible statesof the receiving node (for a read request) as determined by the tokens40 it holds. The receiving node may have no tokens 40 as indicated bythe first column; some tokens 40 but no owner token 40 as indicated bythe second column; some tokens 40 but not all the tokens 40 and theowner token 40 as indicated by the third column; and all the tokens 40as indicated by the fourth column.

[0063] In the example of FIG. 3, processor units P₁ and P₂ each have notokens 40 for the block 24, so a request for read of the block 24 willcause the processor units P₁ and P₂ to ignore the message as indicatedby the response of the first column of the table of FIG. 6. Thisresponse may, under certain circumstances provide for an acknowledgementmessage, but no data is transmitted because processor units P₁ and P₂ donot have valid block data or tokens 40.

[0064] If processor units P₁ or P₂ had tokens 40 but not the owner token40, per the second column of the table of FIG. 6, they would also notrespond, knowing the node with the owner token 40 will respond per thethird column of the table of FIG. 6. If processor units P₁ or P₂ hadless than all the tokens 40 and owner token 40, per the third column ofthe table of FIG. 6, they would respond with the data of the block 24and a token 40, but optionally not the owner token 40 unless that wasall they had. A programmed reluctance to give up the owner token 40 isone way to enhance performance by minimizing transfer of ownershipunless there is a compelling reason to do so. If the node has only theowner token, then it must send the owner token.

[0065] Referring again to the example of FIG. 3, in contrast toprocessor units P₁ and P₂, shared memory 16 has valid data of the block24 indicated by the existence of at least one token 40 in the tokentable 30 of the shared memory 16. Accordingly, the shared memory 16responds with one token 40′ in a reply message 44 to processor unit P₀per the fourth column of the table of FIG. 6. Because shared memory 16has the owner token 40 (indicated by a star next to the token symbol ofFIG. 3) the shared memory will also send the data 42 of the block 24requested per invariant V. The use of the owner token 40 in this case isintended to eliminate the need for several nodes which have tokens 40 toall send duplicative data 42. Interconnect traffic is significantlyreduced through the use of the owner token 40 as described. Note thatthe shared memory 16 does not send the owner token 40.

[0066] In a performance enhanced version of the response of column fourof the table of FIG. 6, when a read request is received by processorunit P₁ for example, holding all of the tokens 40, the processor unit P₁sends all tokens 40 to the requesting node processor unit P₀ if a writewas recently completed by the processor unit P₁. This rule accommodatesmigratory data sharing patterns well known to those of ordinary skill inthe art. In the case where the reading of the block has not beencompleted at processor unit P₁, only one token 40 is sent and preferablynot the owner token 40 under the assumption that a read or a write atprocessor unit P₁ will be forthcoming and less data will ultimately needto be transmitted back to processor unit P₁.

[0067] Briefly, if no owner token 40 were used, the second column of thetable of FIG. 6 would be omitted and all nodes would send a token 40 anddata 42.

[0068] Referring still to the example of FIG. 3, at the conclusion ofthis read request, processor unit P₀ has a single token 40 and the dataof the block 24 from the shared memory 16 and thus may read the blockthat it desires.

A Request to Write to Shared Memory

[0069] Referring now to FIG. 4, two processor units P₀ and P₁ may eachinitially have one token 40 and the shared memory 16 may initially havetwo tokens 40. In the event that the third processor unit P₂ requestswrite access to a block 24 represented by those tokens 40, processorunit P₂ will broadcast write requests 46 to each of the other nodes ofprocessor units P₀ and P₁ and shared memory 16. Referring to the firstcolumn of the table of FIG. 6, any node having no token 40 may simplyignore this request. However, processor units P₀ and P₁ each have onetoken 40, and thus, per the second column of the table of FIG. 6, willreply by sending all their tokens 40 in a reply message 48. In thiscase, the shared memory 16 has the owner token 40 and so under the thirdcolumn of the table of FIG. 6, the shared memory 16 sends all its tokens40 and the necessary data of the block 24. The same result would beobtained if the shared memory 16 had all tokens 40 and thus implicitlythe ownership token 40.

[0070] At any time, because of the non-ordered nature of theinterconnect 14, a node may receive tokens 40 that are not expected. Inorder to accommodate possible limits in data storage at the nodes,unwanted tokens 40 and data may be resent by the node, typically to theshared memory 16 to avoid the need for local storage. Additionally, whenstorage space is required in any node, that node may on its owninitiative, send its tokens 40 to the shared memory 16 to free-up space.Only the node having the owner token 40 carries with it a duty to sendthe actual data. In implementations where an owner token 40 is not used,data associated with each token 40 must be transmitted by the node whenit evicts the tokens 40.

[0071] More sophisticated protocols than those shown in FIG. 6 may beused to enhance performance over the correctness substrate provided. Forexample, write or read requests may be predictively limited to subsetsof the nodes where the data is expected to be found to reduce bandwidthon the interconnect 14. Correctness is ultimately ensured by the tokens40, independent of the accuracy of the predictions as to where thetokens may be found.

Token Access Guarantees

[0072] It will be understood, from the above, that the passing of thetokens 40 provides a definitive indication of the rights of each node toaccess a block of the shared memory 16. However, the particularprotocols, as defined by the numbered invariants above and shown in thetable of FIG. 6, do not ensure that a given node will ever get thenecessary tokens 40. “Starvation” may occur, for example, when twocompeting nodes both requiring write access are repeatedly interruptedin their token gathering by each other or a third node requesting readaccess. Thus, as a practical matter, the issue of memory access“starvation” must also be addressed ensuring that a given noderequesting access ultimately does get the access in a reasonably timelymanner.

[0073] The present invention provides two methods of dealing with accessstarvation, however, it is contemplated that other methods may also beused and several methods may be combined.

[0074] Referring to FIG. 5, the cache management circuitry 18 of eachprocessor unit 12 may monitor token requests indicated by process block50 at that processor unit 12. After a predetermined period of time haselapsed without receipt of the requested tokens 40 for reading orwriting to shared memory 16, as indicated by the loop formed withdecision block 52, the cache controller 26 may delay for a back-off timeper block 56 and reissue the request for the token 40 indicated byprocess block 54. The back-off time may be a randomly selected timeperiod within a range which increases for each invocation of theback-off time block 56, for example, like the back-off time used incommunication protocols like Ethernet. The back-off time may, forexample, be twice the average miss latency and may adapt to average misslatency on a dynamic basis.

[0075] This back-off time and repeated request per process blocks 56 and54 may be repeated for a given number of times, for example, four times,per decision block 56 and the loop formed thereby.

[0076] After completion of the timeout period implemented by thedecision block 56, if the tokens 40 have not been received so that thenecessary read or write request may be completed, a persistent requestmay be initiated as indicated by process block 58.

[0077] Generally, “persistent” requests persist at all nodes (i.e.,processor units 12, and memory 16). All nodes remember that tokens(currently held or that arrive in the future) for a given block B(subject to the persistent request) should be forwarded to processor P(making the persistent request). To limit the number of states 70 thatneeds to be remembered, each processor is limited to K persistentrequests, bounding the number of persistent requests in the system (andthus the number of entries in the table 34) to N*K. K is likely to be asmall constant, and may be K=1.

[0078] There are two methods that may used to implement a persistentrequest. The first method requires a central arbiter such as the memory16, although different blocks may have different arbiters so long aseach node 12 can identify the arbiter for a particular node. Thisapproach requires indirection of persistent request messagetransmission, first to the arbiter and then to other nodes. The secondmethod is “distributed” and does not require this indirection.

[0079] Referring to FIG. 7, in the first method, the persistent requestmessage 60 may be transmitted, for example, from the first processorunit P₀ to the shared memory 16, the latter providing a central locationto deal with possible multiple persistent requests for the same blockfrom different nodes. The shared memory 16 thus may prioritize therequests so that only one persistent request message for a given blockmay be serviced at one time.

[0080] Assuming that the particular processor units P₀ initiating apersistent request is seeking access to a block 24 that is not subjectto any other persistent requests, then the shared memory 16 (forexample, as the home node for that block) submits an activation message62 to all other nodes and to the requesting processor units P₀. Othersubsequent persistent requests for that block are queued by the sharedmemory 16.

[0081] Referring to FIG. 2, when each node receives the activationmessage 62, it enrolls the identification of the processor unit (P₀)making a request in the persistence table 34 along with theidentification of the block 24 for which the persistent request is beingmade. At this point onward, so long as the entry is in table 34, thenode will forward the token 40 to the requesting processor unit (P₀)indicated in the first column of the table 34 whether the node currentlyhas the token 40 or receives the token 40 subsequently. As has beendiscussed, data is forwarded only if the token 40 is the owner token.Each processor unit P₀ through P₃ is responsible for only invoking nomore than a limited number of persistent requests at a time, thuslimiting the size of the persistence tables 34 of each node shown inFIG. 2.

[0082] When the requesting processor unit 12 (P₀) has,completed thememory access underlying the persistent request, that requestingprocessor unit (P₀) forwards a deactivation message 66 to the sharedmemory 16 which broadcasts the deactivation message 68 to all processorunits 12. Upon receipt of the deactivation message 68, each node deletesthe entry in the node's persistence table 34. The shared memory 16 maythen activate another persistent request for that block from its queuedpersistent requests according to a preselected arbitration scheme, mostsimply, according to the next persistent request in queue.

[0083] More specifically, point-to-point order on the interconnect 14 orexplicit acknowledgement messages can be used to handle races whereactivations/deactivations can cross each other in the interconnect 14.The sender does not send the next activation or deactivation messageuntil it has received all the acknowledgement messages for the priorsuch message, thus preventing reorderings. As will be known to oneskilled in the art, there are many alternative solutions such as usingpoint-to-point ordering in the interconnection network to enforcein-order delivery or using message sequence numbers to detect andrecover from message reorderings.

[0084] In the second decentralized method of handling persistentrequests, each processor unit 12 directly broadcasts its persistentrequests to all other nodes in the system 10. These nodes allocate anentry in their table 34 for this request. If two processor units 12 bothissue persistent requests for the same block, all processor units 12 inthe system must arbitrate to determine who should receive the tokens.This arbitration may be done by statically assigning a priority based ona numerical identification number previously assigned. Referring now toFIG. 8, for this purpose, each individual node may replace persistencetable 34 with persistence table 70 similar to persistence table 34listing persistent requests made by other nodes but not yet activated.The processor units 12 monitoring this table 70 may activate one suchrequest on a global basis by following a common rule. For example, therule may be that the next node in line for activation of its persistentrequest will be the node with the lowest numerical identification(described above)of the contesting nodes. This works in the presence ofraces, since two nodes may temporarily disagree on which node is thelowest, but eventually all nodes will agree and forward the tokens tothe lowest numbered node.

[0085] Once a processor unit 12 is no longer starving, it deactivatespersistent requests by broadcasting a deactivation to all nodes whichclear the entry in their tables 70. To prevent the highest priorityprocessor from starving other processors, the system must be careful asto when processors are allowed to issue subsequent persistent requests.For example, if a processor is allowed to issue a persistent requestimmediately, it may starve other processors, and if a processor isrequired to wait until its table is empty, other processors can starveit. In a preferred embodiment, when a processor unit 12 completes apersistent request, it marks each entry for the block currently in itstable 70. This processor unit 12 must wait until all of the markedentries have been ‘deactivated’ and removed from the table 70 beforeissuing another persistent request for that block.

[0086] In other words, when a node completes a persistent request for anaddress A, it marks all persistent requests in its table 70 that matchaddress A (add a “pending bit” (not shown) to table 70). Before issuinga persistent request for address A, a processor unit must consult itslocal table 70. If an address A matches AND the pending bit is set forthat entry, then this is a second persistent request which must stall.Otherwise, it may proceed.

[0087] Referring again to FIG. 1, the use of an arbitration system thatlooks at numerical identifications ensures the data is first passedpreferably within clusters of nodes thus reducing data transit time.This implementation of persistent requests can be performed in adistributed fashion within the nodes and thus does not require a centraldirectory-type structure, the resulting indirection of message transfer,and can be implemented in so-called glueless systems where additionalprocessor units 12 may be combined with minimal glue logic. Again, thesefeatures are not critical to the core correctness substrate provided bythe tokens 40 of the present invention. As described, these approachesboth use broadcast of the persistent request messages, but one could usea multicast to a predicted set of ‘active’ processors before resortingto broadcast, enhancing the scalability of the invention.

[0088] Empirically, the present inventors have determined that with mostmemory access requests, tokens 40 will be obtained immediately or viathe back-off and request of process blocks 56 and 54 without the needfor a persistent request message. Nevertheless, the indirection ofcommunicating a persistent request message via the shared memory (orother designated node) introduces considerable delay in the transfer ofdata and may be desirably avoided by using a second, more sophisticatedapproach.

[0089] The above described token-based system for cache control clearlyneed not be implemented on a single integrated circuit but is broadlyapplicable to any cache system where multiple processing units competefor access to common memory and thus the present invention can be usedin systems having network connected processing units including but notlimited to Internet caching systems. Clearly, the invention can beimplemented in hardware, firmware, or software or a combination of allthree.

[0090] It is specifically intended that the present invention not belimited to the embodiments and illustrations contained herein, but thatmodified forms of those embodiments including portions of theembodiments and combinations of elements of different embodiments alsobe included as come within the scope of the following claims.

We claim:
 1. A computer system comprising: a) at least two processorunits each having at least one processor and at least one cache; b) ashared collection of data; c) a communication channel allowingcommunication between the processor units and the shared collection ofdata; d) cache management means operating to: i) establish a set oftokens; ii) allow a processor to write to at least a portion of theshared collection of data through its cache only if the processor hasall the tokens for that portion; and iii) allow a processor to read fromat least a portion of the shared collection of data through its cacheonly if the processor has at least one of the tokens for that portion.2. The computer system of claim 1 wherein the number of tokens is noless than the number of processor units.
 3. The computer system of claim1 further including the step of responding to a request by a requestingprocessor to write to a portion of memory by sending to other processorsa request message for write privileges for the portion of memory; andwherein the cache management means responds to the request message by areceiving processor having at least one token by sending all tokens forthat portion held by the receiving processor to the requestingprocessor.
 4. The computer system of claim 3 wherein the cachemanagement means broadcasts the request message to the other processors.5. The computer system of claim 3 wherein one token is an owner tokenand wherein the cache management means responds to the request messageto send the portion held by the receiving processor to the requestingprocessor only when the receiving processor holds the owner token. 6.The computer system of claim 3 wherein the cache management meansresponds to the request message without sending the portion held by thereceiving processor to the requesting processor when the receivingprocessor does not hold the owner token.
 7. The computer system of claim1 further including the step of responding to a request by a requestingprocessor to read a portion of memory by sending to the other processorsa request message for read privileges for the portion of memory andwherein the cache management means responds to the request messagereceived by a receiving processor having at least one token to send atleast one token for the portion held by the receiving processor to therequesting processor.
 8. The computer system of claim 7 wherein thecache management means broadcasts the request message to the otherprocessors.
 9. The computer system of claim 7 wherein the cachemanagement means sends only one token for that portion.
 10. The computersystem of claim 7 wherein one token is an owner token and wherein thecache management means responds to the request message to send a tokenother than the owner token for the portion unless the receivingprocessor has only one token and then sending the owner token for theportion.
 11. The computer system of claim 7 wherein one token is anowner token and wherein the cache management means responds to therequest message received by a receiving processor having all the tokensto send a token for the portion that is not the owner token unless thereceiving processor has completed a writing to the portion and thensending all tokens for the portion to the requesting processor.
 12. Thecomputer system of claim 1 wherein the cache management meanscoordinates the transfer of tokens between processor units according torequests by the processor units to access of the shared collection ofdata by transmitting token requests and wherein the cache managementmeans responds to a predefined failure of a requesting processor toobtain tokens by retransmitting a token request after a predeterminedback-off time.
 13. The computer system of claim 12 wherein the back-offtime is randomized.
 14. The computer system of claim 12 wherein theretransmission is repeated a predetermined number of times withincreasing length of back-off time.
 15. The computer system of claim 1wherein the cache management means coordinates the transfer of tokensbetween processor units according to requests by the processor units toaccess of the shared collection of data by transmitting token requestsand wherein the cache management means responds to a predefined failureof a requesting processor to obtain tokens by prioritizing tokenrequests.
 16. The computer system of claim 3 wherein the cachemanagement means prioritizes token requests by sending to otherprocessors a persistent activation signal requiring the other processorto forward tokens for that portion to the requesting processor until adeactivation message is received; and wherein the cache management meansresponds to the persistent activation signal received by a receivingprocessor to send all tokens for the portion held or received by thereceiving processor between the occurrence of the persistent activationmessage and the deactivation signal.
 17. The computer system of claim 16wherein the cache management means responds to a predetermined failureof multiple requesting processors to obtain tokens by broadcasting thepersistent activation signal of one of the requesting processors at atime according to a predetermined arbitration rule.
 18. The computersystem of claim 17 wherein the predetermined arbitration rule selectssending of persistent activation signals to minimize the communicationcosts of transmitting data between successive ones of the multiplerequesting processors.
 19. The computer system of claim 1 wherein thecache management means is distributed among the processor units and thememory.
 20. The computer system of claim 1 where the interconnect is anunordered interconnect.
 21. A method of operating a computer systemhaving: a) at least two processor units each having a processor andcache; b) a shared collection of data; and c) an interconnect allowingcommunication between the processor units and the shared collection ofdata; comprising the steps of: i) establishing a set of tokens no lessin number than the number of processor units accessing shared collectionof data; ii) allowing a processor to write to at least a portion of theshared collection of data through its cache only if the processor hasall the tokens for that portion; and iii) allowing a processor to readfrom at least a portion of the shared collection of data through itscache only if the processor has at least one of the tokens for thatportion.
 22. The method of claim 21 further including the steps of:responding to a request by a requesting processor to write to a portionof memory by sending to other processors a request message for writeprivileges for the portion of memory; and responding to the requestmessage by a receiving processor having at least one token by sendingall tokens for that portion held by the receiving processor to therequesting processor.
 23. The method of claim 22 wherein the requestmessage is broadcast to the other processors.
 24. The method of claim 22wherein one token is an owner token and further including the step ofresponding to the request message to send the portion held by thereceiving processor to the requesting processor only when the receivingprocessor holds the owner token.
 25. The method of claim 22 furtherincluding the step of responding to the request message without sendingthe portion held by the receiving processor to the requesting processorwhen the receiving processor does not hold the owner token.
 26. Themethod of claim 21 further including the steps of responding to arequest by a requesting processor to read a portion of memory by sendingto the other processors a request message for read privileges for theportion of memory and responding to the request message received by areceiving processor having at least one token to send at least one tokenfor the portion held by the receiving processor to the requestingprocessor.
 27. The method of claim 26 wherein the request message isbroadcast to the other processors.
 28. The method of claim 26 whereinonly one token for that portion is sent.
 29. The method of claim 26wherein one token is an owner token and further including the step ofresponding to the request message to send a token other than the ownertoken for the portion unless the receiving processor has only one tokenand then sending the owner token for the portion.
 30. The method ofclaim 26 wherein one token is an owner token and further including thestep of responding to the request message received by a receivingprocessor having all the tokens to send a token for the portion that isnot the owner token unless the receiving processor has completed awriting to the portion and then sending all tokens for the portion tothe requesting processor.
 31. The method of claim 21 further includingthe step of coordinating the transfer of tokens between processor unitsaccording to requests by the processor units to access of the sharedcollection of data by transmitting token requests and responding to apredefined failure of a requesting processor to obtain tokens byretransmitting a token request after a predetermined back-off time. 32.The method of claim 31 wherein the back-off time is randomized.
 33. Themethod of claim 31 wherein the retransmission is repeated apredetermined number of time with increasing length of back-off time.34. The method of claim 21 further including the step of coordinatingthe transfer of tokens between processor units according to requests bythe processor units to access of the shared collection of data bytransmitting token requests and responding to a predefined failure of arequesting processor to obtain tokens by broadcasting to otherprocessors a persistent activation signal requiring the other processorto forward tokens for that portion to the requesting processor until adeactivation message is received; and further including the step ofresponding to the persistent activation signal received by a receivingprocessor to send all tokens for the portion held or received by thereceiving processor between the occurrence of the persistent activationmessage and the deactivation signal.
 35. The method of claim 32 furtherincluding the step of responding to a predetermined failure of multiplerequesting processors to obtain tokens by broadcasting the persistentactivation signal of one of the requesting processors at a timeaccording to a predetermined arbitration rule.
 36. The method of claim35 wherein the predetermined arbitration rule selects sending ofpersistent activation signals to minimize the communication costs oftransmitting data between successive ones of the multiple requestingprocessors.